A Context-Aware Topic Model for Statistical Machine Translation
نویسندگان
چکیده
Lexical selection is crucial for statistical machine translation. Previous studies separately exploit sentence-level contexts and documentlevel topics for lexical selection, neglecting their correlations. In this paper, we propose a context-aware topic model for lexical selection, which not only models local contexts and global topics but also captures their correlations. The model uses target-side translations as hidden variables to connect document topics and source-side local contextual words. In order to learn hidden variables and distributions from data, we introduce a Gibbs sampling algorithm for statistical estimation and inference. A new translation probability based on distributions learned by the model is integrated into a translation system for lexical selection. Experiment results on NIST ChineseEnglish test sets demonstrate that 1) our model significantly outperforms previous lexical selection methods and 2) modeling correlations between local words and global topics can further improve translation quality.
منابع مشابه
A Topic-Triggered Language Model for Statistical Machine Translation
Language model is an essential part in statistical machine translation, but traditional n-gram language models can only capture a limited local context in the translated sentence, thus lacking the global information for prediction. This paper describes a novel topic-triggered language model, which takes into account the topical context by estimating the n-gram probability under the given topics...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملUsing Features from Topic Models to Alleviate Over-Generation in Hierarchical Phrase-Based Translation
In hierarchical phrase-based translation systems, the grammars (SCFG rules) have over-generation problem because we can replace the non-terminalX with almost everything without knowing the syntactic or semantic role ofX . In this paper, we present an approach that uses topic models to learn the distributions for non-terminals in each SCFG rule, based on which we further derive static features f...
متن کاملVirtual Babel: Towards Context-Aware Machine Translation in Virtual Worlds
In this paper, we describe our ongoing research project of Virtual Babel, a contextaware machine translation system for Second Life, one of the most popular virtual worlds. We augment the Second Life viewer to intercept the incoming/outgoing chat messages and reroute the message to a statistical machine translation server. The returned translations are appended to the original text message to h...
متن کاملارائه یک رتبهبند برای خطایاب معنایی با استفاده از ویژگیهای حساس به متن
Nowadays, a large volume of documents is generated daily. These documents generated by different persons, thus, the documents contain spelling errors. These spelling errors cause quality of the documents are decrease. Therefore, existence of automatic writing assistance tools such as spell checker/corrector can help to improve their quality. Context-sensitive are misspelled words that have been...
متن کامل